The What - Today’s goal

For this project I decided to implement a few Bayesian data imputation strategies, starting from a standard Multivariate Normal strategy and progressively incorporating data and domain knowledge to the technique. Performances are evaluated by comparing the techniques among themselves and with the actual “missing” data.

The Why - Data exploration

The data used comes from CalCOFI, an organization that conducts quarterly cruises from north of San Francisco Bay to San Diego and extends from the coast to 500 km offshore, spanning national and international waters, to take samples of ocean water at various depths.

The data used throughout this project is the bottle data, which contains data about ocean water at various depths, but ignores the specific point at which the sample was taken.

This dataset contains several features, spanning from the most basic “Temperature” and “Salinity” of the water to quality and precision indices for each measure, up to the most extravagant “Phaeop” (the concentration of phaeopigments, non-photosynthetic pigment which is the degradation product of algal chlorophyll pigments). I only selected \(5\) of these features for the analysis:

Loading the data makes immediately clear why this dataset is suited for this project: it is really big with its \(892932\) data points, and there is lots of missing data!

##      Depthm           T_degC          O2ml_L           STheta      
##  Min.   :   0.0   Min.   : 1.44   Min.   :-0.01    Min.   : 20.93  
##  1st Qu.:  45.0   1st Qu.: 7.71   1st Qu.: 1.37    1st Qu.: 24.96  
##  Median : 125.0   Median :10.07   Median : 3.45    Median : 25.99  
##  Mean   : 224.6   Mean   :10.82   Mean   : 3.40    Mean   : 25.81  
##  3rd Qu.: 300.0   3rd Qu.:13.91   3rd Qu.: 5.50    3rd Qu.: 26.64  
##  Max.   :5351.0   Max.   :31.14   Max.   :11.13    Max.   :250.78  
##                   NA's   :10965   NA's   :169657   NA's   :52692   
##      O2Sat       
##  Min.   : -0.10  
##  1st Qu.: 21.30  
##  Median : 54.50  
##  Mean   : 57.23  
##  3rd Qu.: 97.70  
##  Max.   :214.10  
##  NA's   :204578

Since it is really big, and since the final goal is to evaluate data imputation techniques, let’s take just a subset of \(10000\) rows, without including any missing value.

We can now explore our final dataset; let’s start by looking at how the features interact among themselves:

As we could have expected, O2Sat and O2ml_L are strongly linearly correlated, and we can also identify a negative correlation between T_degC and STheta. While STheta and T_degC “could” be modelled as normal distributions, the distributions of O2ml_L and O2Sat are highly non-Gaussian (and very similar between each other).

We can also see strong non-linear correlations between Depthm and the other features, this suggests a particular relevance of this variable. As already said, this measurements are not “random” but rather are planned in the design of the experiment. We can see that its distribution looks like an exponential, let’s take a closer look at the distribution along with a simple estimate of its underlying distribution:

Since this variable is not really random, it may be interesting to look at the conditional distributions of the other features given our data. Here are shown the distributions at 4 different depths, 0,10,100 and 1000 m:

As we can see, the Gaussianity assumption on the other features is much more reasonable when conditioning on Depthm, this is going to be useful later on.

The How - Models and techniques

The portion of absent values for some of these features is really high, my goal will be to take a subset of the data without missing values, masking some values and trying to impute them in different ways. To do this, let’s extract a random sample from the full dataset to take a sample out of the distribution of the NA values, this is useful to create the mask.

## Distribution of the number of missing values in each row
## 
##    0    2    3    4 
## 7533 2328  134    5
## 
## Number of missing values in each row
## Depthm T_degC O2ml_L STheta  O2Sat 
##      0     39   2293    264   2462

In creating this mask, we removed the rows with 4 missing values as they are just 5 and it is unrealistic to impute so many values. As we can see, there are lots of missing values in the O2 columns, and most rows only have 2 missing values.

To evaluate the goodness of our imputation we will both visually compare the distribution and the correlations between the features with the ones obtained in the full dataset (shown above), and we will compute the MSE of the imputed data wrt the real one. To make result interpretation more convenient, the data is normalized here to have zero mean and unit variance.

##      Depthm            T_degC             O2ml_L            STheta        
##  Min.   :-0.7237   Min.   :-2.23678   Min.   :-1.6570   Min.   :-3.96337  
##  1st Qu.:-0.5620   1st Qu.:-0.73360   1st Qu.:-0.9528   1st Qu.:-0.83092  
##  Median :-0.3112   Median :-0.17615   Median : 0.0233   Median : 0.16826  
##  Mean   : 0.0000   Mean   : 0.00036   Mean   : 0.0008   Mean   : 0.00147  
##  3rd Qu.: 0.2332   3rd Qu.: 0.73147   3rd Qu.: 1.0140   3rd Qu.: 0.81771  
##  Max.   :13.7022   Max.   : 4.47157   Max.   : 2.3834   Max.   : 1.97289  
##                    NA's   :39         NA's   :2293      NA's   :264       
##      O2Sat        
##  Min.   :-1.5420  
##  1st Qu.:-0.9564  
##  Median :-0.0766  
##  Mean   :-0.0005  
##  3rd Qu.: 1.0879  
##  Max.   : 2.3079  
##  NA's   :2462

Model 1: the story of a naive frequentist

I decided with the simplest possible approach to data imputation: substituting each value with the mean of the corresponding column. This “frequentist” approach is really simple, we’ll use it as a baseline to compare the other methods.

As we can see, when the number of NA values is small, the distribution is almost unchanged, but the effect of this imputation is clearly visible in the scatter plots. When the number of NA values is consistent, the distribution is also heavily affected by the averages we added.

Let’s see how it performs with respect to the “true” masked values.

## MSE of the imputed data wrt the correct ones:
##   Overall    T_degC    O2ml_L    STheta     O2Sat 
## 1.0088166 0.9217716 1.0119767 0.9966549 1.0084843

As expected, this imputation method makes it so that the MSE is about \(1\) (significantly more different for the columns with fewer NAs).

Model 2: it was a Multivariate Normal all along

The first data imputation strategy consists in modelling the data as if coming out of a multivariate normal distribution; the normality assumption on the variables is not really good so this approach is not optimal for this dataset.

Three chains are run in this setup, each with a random set of prior means and variances (distributed, respectively, as a standard Normal and as a Gamma). The prior correlations between the variables are all set to \(0\), and the prior missing values are just the column averages.

Convergence diagnostics

Let’s look at some diagnostics. Since we have lots of missing data, we cannot look in detail at each of the chains to look for convergence; we will rely instead on scalar measures of convergence. In particular, here you can see the distribution of the potential scale reduction factor for the 3 set of parameters:

And here you can see the distribution of the geweke statistic for each chain in each set of parameters:

Both statistics look pretty good. On the X.MISS category you can observe a superimposed standard normal; since we computed so many statistics, the significance does not lie in the fact that the values are in \([-1.96, 1.96]\), but rather in verifying that the distribution is not too far from a normal.

Performance evaluation

We now want to infer the final imputed values to compare them to the other models. To compute these values we use the average of the values extracted for each of the missing points.

The plots look a lot better than in the frequentist case. The correlations still have some problems but they are much less noticeable.

Let’s see how it performs with respect to the “true” masked values.

## MSE of the imputed data wrt the correct ones:
##   Overall    T_degC    O2ml_L    STheta     O2Sat 
## 0.3686598 0.4731546 0.4085384 0.2643438 0.3357546

This imputation method is a big improvement wrt the frequentist approach in terms of MSE. It’s interesting to note that the smallest value does not belong to the class with the smallest number of NA value, marking the statistical basis of the method.

Model 3: maybe Depth is important after all..

In this third attempt I decided to leverage the fact that the distribution of the features conditional on Depthm can be much more reasonably approximated with a gaussian, so this approach actually consists in binning the depth variable and running the previous Multivariate Normal approach independently on each of these bins. This method mimics the use of the conditional distribution of the other variables give Depthm.

The bins for this variable have been chosen by hand, selecting an higher density for smaller values and a smaller density for higher values in such a way that the number of samples in each bin would be more or less constant. In this plot you can see how the \(25\) bins were chosen:

## [1] "Percentage of NA values for each column per each BIN:"
##      Depthm    T_degC   O2ml_L   STheta    O2Sat Number of samples
## 0         0 0.3809524 22.28571 4.000000 24.57143               525
## 10       10 0.0000000 23.91304 2.173913 25.36232               552
## 20       20 0.7067138 23.67491 2.826855 25.44170               566
## 30       30 0.8756567 21.54116 4.028021 23.81786               571
## 40       40 0.5235602 21.98953 3.664921 24.60733               191
## 50       50 0.5474453 22.99270 2.554745 24.63504               548
## 60       60 0.0000000 21.92513 3.743316 24.06417               187
## 70       70 0.3676471 22.97794 2.573529 24.08088               544
## 80       80 0.6622517 22.51656 4.635762 26.49007               151
## 90       90 0.0000000 21.15385 2.564103 23.07692               156
## 100     100 0.1908397 24.80916 2.671756 25.95420               524
## 115     115 0.3430532 22.98456 2.229846 24.87136               583
## 135     135 0.0000000 17.68293 2.439024 18.90244               164
## 155     155 0.0000000 24.10546 1.506591 25.04708               531
## 175     175 0.0000000 19.04762 1.190476 19.64286               168
## 195     195 0.6993007 24.94172 3.263403 27.73893               429
## 215     215 0.6622517 19.86755 2.649007 21.85430               151
## 235     235 0.8695652 21.73913 3.478261 25.21739               115
## 295     295 0.3748828 22.11809 1.780694 23.33646              1067
## 395     395 0.1605136 21.82986 2.407705 23.59551               623
## 495     495 0.4966887 24.66887 2.317881 25.82781               604
## 695     695 0.1709402 24.44444 2.051282 25.47009               585
## 922     922 0.9433962 25.47170 3.301887 27.35849               212
## 1500   1500 1.0416667 21.35417 3.645833 24.47917               192
## 3000   3000 0.0000000 19.67213 3.278689 22.95082                61

Let’s also have a look at the correlations between the variables inside some of the bins:

This approach is probably not suited for this dataset, given how many NA rows are there in each bin. Some conditional correlations are reasonable, while others are still clearly nonlinear.

Three chains are run in this setup, each with a random set of prior means and variances (distributed, respectively, as a standard Normal and as a Gamma). The prior correlations between the variables are all set to \(0\), and the prior missing values are just the column averages.

Convergence diagnostics

Let’s look at some diagnostics. Here you can see the distribution of the potential scale reduction factor for the 3 set of parameters in each chain:

The chain corresponding to the THETA has good indication of convergence; while the chain corresponding to the Sigma has troubles converging, let’s look more in detail at one of them: \(659\ m\) of depth. We can easily see the cause of the big reduction factor: the three chains are completely off.

## [1] "PSRF for Depth = 695 m"
##          [,1]      [,2]     [,3]     [,4]
## [1,] 7.579418  1.082410 1.005719 1.030329
## [2,] 1.082410 12.246040 1.136825 1.059646
## [3,] 1.005719  1.136825 6.257068 1.040944
## [4,] 1.030329  1.059646 1.040944 4.502493
## [1] "Means of the 3 chains for the variance of the O2Concentration:"
## [1] 0.01535362 0.02519422 0.05228679

This particular problem is due to the inadequacy of the model: the bin corresponding to \(295 m\) still has highly nonlinear correlations.

And here you can see the distribution of the geweke statistic for each chain in each set of parameters. I decided to join the values for the three chains to make it more readable:

The values for the parameters here are pretty good, so the individual chains probably reached a stable points; there are clearly problems with the model.

Performance evaluation

We now want to infer the final imputed values to compare them to the other models. To compute these values we use the average of the values extracted for each of the missing points.

The individual distributions are pretty close, but the correlations between the variables are really off. It’s peculiar to see how the two most linearly correlated variables have messed up results in their correlations.

Let’s see how it performs with respect to the “true” masked values.

## MSE of the imputed data wrt the correct ones:
##  Overall   T_degC   O2ml_L   STheta    O2Sat 
## 1.187493 1.515858 1.244134 1.189666 1.125721

As we can see, the MSE is worse than the frequentist case! This is probably due to the big number of NA values wrt the total number of data points in each bin.

Model 4: a good old regression

As a final attempt, we remarked before how the correlation between Depthm and the other variables was quite clear, let’s plot them here:

These plots remind of some basic functional forms of the type: \[ A(1-e^{-Bx}) + C\qquad\qquad\qquad -Axe^{-Bx}+C \] with parameters \(A,B,C\), the idea is then to infer each missing point only using the Depthm column by fitting the above functional forms to the data.

Here is an example of this two functional shapes for a suitable set of parameters \(A,B,C\):

For each column three chains are run, each with a random set of prior \(A,B,C\). Let’s start with the temperature column, we will cheat and look at the plots to infer the most suitable prior distribution for the parameters. We need: * \(A\) to be negative (as the function should decrease), so we set \(-A\sim Gamma(1,1)\) * \(B\) to be positive, so we set \(B\sim Gamma(1,1)\) * \(C\) to be positive, so we set \(C\sim Gamma(1,1)\)

The Jags model used is:

TempReg = "
  model {
    #Likelihood
    for (i in 1:n) {
      y[i]~dnorm(mu[i],tau)
      mu[i] <- A*(1-exp(-B*X[i])) + C
    }
    
    #Priors
    Aneg ~ dgamma(1,1)
    B ~ dgamma(1,1)
    C ~ dgamma(1,1)
    A = -Aneg
    tau <- 1 / (sigma * sigma)
    sigma~dunif(0,100)
  }
"

Three parallel chains are used for this model, with thinning of 10 points:

Similarly, for STheta we want: * \(A\) to be negative (as the function should increase), so we set \(A\sim Gamma(1,1)\) * \(B\) to be positive, so we set \(B\sim Gamma(1,1)\) * \(C\) to be negative, so we set \(-C\sim Gamma(1,1)\)

The Jags model used is:

SThetaReg = "
  model {
    #Likelihood
    for (i in 1:n) {
      y[i]~dnorm(mu[i],tau)
      mu[i] <- A*(1-exp(-B*X[i])) + C
    }
    
    #Priors
    A ~ dgamma(1,1)
    B ~ dgamma(1,1)
    Cneg ~ dgamma(1,1)
    C=-Cneg
    tau <- 1 / (sigma * sigma)
    sigma~dunif(0,100)
  }
"

For O2con we want: * \(A\) to be positive (as the function should decrease), so we set \(A\sim Gamma(1,1)\) * \(B\) to be positive, so we set \(B\sim Gamma(1,1)\) * \(C\) to be negative, so we set \(-C\sim Gamma(1,1)\)

The Jags model used is:

O2conReg = "
  model {
    #Likelihood
    for (i in 1:n) {
      y[i]~dnorm(mu[i],tau)
      mu[i] <- -A*X[i] * exp(-B*X[i]) + C
    }
    
    #Priors
    A ~ dgamma(1,1)
    B ~ dgamma(1,1)
    Cneg ~ dgamma(1,1)
    C = -Cneg
    tau <- 1 / (sigma * sigma)
    sigma~dunif(0,100)
  }
"

Similarly, for O2Sat we want: * \(A\) to be positive (as the function should decrease), so we set \(A\sim Gamma(1,1)\) * \(B\) to be positive, so we set \(B\sim Gamma(1,1)\) * \(C\) to be negative, so we set \(-C\sim Gamma(1,1)\)

The Jags model used is:

O2satReg = "
  model {
    #Likelihood
    for (i in 1:n) {
      y[i]~dnorm(mu[i],tau)
      mu[i] <- -A*X[i] * exp(-B*X[i]) + C
    }
    
    #Priors
    A ~ dgamma(1,1)
    B ~ dgamma(1,1)
    Cneg ~ dgamma(1,1)
    C = -Cneg
    tau <- 1 / (sigma * sigma)
    sigma~dunif(0,100)
  }
"

Here we plot the inferred curves over the train data to check if something went wrong:

The fit was successful. A relationship of the type \(\frac{A}{x} + B\) would probably suit more temperature and density. The oxygen variables were fitted pretty well. The main bottleneck here is the fact that the variance is constant, if we could have taken into account a functional form for the variance the regression would probably be a bit better.

Convergence diagnostics

Let’s look at some diagnostics. Since we have a small set of parameters here, we can look a bit more in detail for convergence:

## Potential Scale Reduction factor for the parameters:
##         A         B         C  deviance     sigma 
## 0.9994715 1.0001227 1.0010581 1.0018672 0.9997905
## 
## Geweke statistics for the three chains:
##             [,1]    [,2]    [,3]
## A        -0.5181  0.7753 -0.7358
## B        -0.4392  0.7113 -0.6357
## C         0.8192 -0.3644  0.2185
## deviance  1.4890 -0.6198 -0.9175
## sigma     1.1857 -0.8126 -1.7507

## Potential Scale Reduction factor for the parameters:
##         A         B         C  deviance     sigma 
## 1.0192370 1.0215139 1.0053824 1.0005324 0.9991134
## 
## Geweke statistics for the three chains:
##             [,1]    [,2]    [,3]
## A        -0.2161  1.2339 -0.3778
## B         0.7188 -1.0582  0.7060
## C         0.2223 -1.4663 -0.1178
## deviance  0.4967 -0.3991 -0.9836
## sigma    -0.4950 -1.8632  0.0079

## Potential Scale Reduction factor for the parameters:
##         A         B         C  deviance     sigma 
## 0.9992895 1.0079675 1.0034907 1.0107078 1.0019550
## 
## Geweke statistics for the three chains:
##             [,1]    [,2]    [,3]
## A        -0.3054 -0.1576  0.2276
## B        -0.9390  0.9022  0.4825
## C         0.9762 -0.1730  0.2621
## deviance  1.2127 -0.1050  0.1832
## sigma    -0.0122 -0.7291 -0.1548

## Potential Scale Reduction factor for the parameters:
##         A         B         C  deviance     sigma 
## 1.0029377 1.0039471 1.0050031 0.9987128 0.9990846
## 
## Geweke statistics for the three chains:
##             [,1]    [,2]    [,3]
## A         2.3044 -0.2490  0.1830
## B        -1.3929 -0.9114 -0.8478
## C        -0.1011  0.3630  0.3456
## deviance -0.7176  1.2430  1.3236
## sigma    -0.6328 -2.3196  0.0524

The chains seem to have reached convergence. They would probably benefit from a longer run to eliminate some of the effects we see here.

Performance evaluation

We now want to infer the final imputed values to compare them to the other models. To compute these values we use the average of the values extracted for each of the missing points.

Visually, both the distributions and the correlations are pretty good. Let’s see how it performs with respect to the “true” masked values.

## MSE of the imputed data wrt the correct ones:
##   Overall    T_degC    O2ml_L    STheta     O2Sat 
## 0.4087890 0.5230797 0.4104546 0.5571087 0.3857311

This imputation method is a big improvement wrt the frequentist approach in terms of MSE. It is not quite as good as the stock multivariate normal, but it gets close. Probably a combination of these methods could produce even better results.